International Journal of Computer Science Engineering 
and Information Technology Research (UCSEITR) 
ISSN(P): 2249-6831; ISSN(E): 2249-7943 
Vol. 7, Issue 2, Apr 2017, 17-24 
© TJPRC Pvt. Ltd. 



TRANS 

STELLAR 

• Journal Publications • Research Consultancy 


AN APPROACH TO EMAIL CATEGORIZATION FOR 
TELECOMMUNICATION CORPUS 

RAJWANT KAUR 1 & GAURAV PATHAK 2 

1 Research Scholar, CSE, Chandigarh University, Ajitgarh, India 
2 Assistant Professor, Department of Applied Science, Chandigarh University, Ajitgarh, India 

ABSTRACT 

At present, most of the transactions and business is taking place through emails and now it is also necessary for 
log-in any site. Due to this a large number of emails are collected in our email account which is hard to read, manage. 
That is reason for email categorizing. Classifying those emails into categories is a convenient way for people to read them. 
So the main aim of this paper is to solve the problem of email overloading by automatically classifies the e-mail into 
different classes based on the content of e-mail. Telecommunication industry dataset is used for the categorization. So this 
system classifies the email system into two categories service and finance. 
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1. INTRODUCTION 

With the expansion of networks, emails have become effectual, speedy and most prudent forms of 
communication. Electronic mail is the method for sending digital messages from one person to another between 
computers via a network. [1] Email remnant the most ubiquitous form of communication because of moderate cost 
and massive use of the internet. Emails are pre-eminent for signing any social media site, for shopping online, for 
online transaction and for online communication. So the number of email users is continually intensified. Acc. to 
radicati group’s report, there are currently 2.6 billion [2] email active users and by the end of 2019 its growth will 
hike up to over 2.9 billion. But the widening of email accounts is growing slightly faster than the number of email 
users for the reason that the users have multiple accounts. The proportion of widening of email accounts is 7% per 
year. But with the growth of sending email messaging, there has been also substantial growth in unsolicited mail. 
The average number of graymail received per user is fourteen which would exceed to nineteen by the end of 2019. 
So there is a requisite of email mining. Email mining is not obligatory for graymail filtrating. Despite that, it is also 
imperative for email foldering. Today we send and receive 90 messages per day. For some people, it is usually 
more than hundred. Hence users spend a lot of their working time on processing the emails and organize these 
emails. At the same time, a large part of email traffic consists of business emails, non-personal emails, and friend’s 
emails. People tussle to distinct crucial messages that urging instant attention. So overloading can be tackled by 
two ways - by email summarization and other is by automatic categorization. Therefore it is uncomplicated to find 
and organize both incoming and existing emails. 

So in this paper Section 2 reviews the previous work in email mining. Section 3 explains the algorithms, 
techniques and dataset that are used in the previous paper in the tabular form. Section 4 presents the results and 
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section 5 concludes the paper. 

2. LITERATURE SURVEY 

Two fast machine learning algorithms [3] that are TF-IDF and Naive Bayes [NB] are implemented for the email 
categorization and three categories are made. Both the algorithms are contrasted. NB gives good results than the TF-IDF. 

Klimt et al. [4] presented the work on email classification based on relationship data. Experiment is conducted on 
enron corpus using the SVM classification algorithm. Hence to bring out the terms of emails, parsing is applied and after 
that using the ltc formula, weights is assigned. Assessment is done on the premise of FI. CMU dataset is put in which is 
self-created by the author to check the performance of the enron. Results are almost similar. 

This technical report [5] presents the email categorization based on the timeline using different supervised 
learning algorithms. Two large corpuses that are enron and SRI are used for this task. So in the preliminary processing 
step, the folders which have fewer messages are deleted. Wide margin Winnow algorithm takes less running time as 
comparative to other algorithms. It is also noticed that wide margin algorithm also outperforms when it is compared to 
regular winnow. 

Xia et al. [6] categorized the emails into the 15 folders for the trouble free access. So for this task two tournament 
methods are proposed, namely Round Robin Tournament (RRT) and Elimination Tournament (ET). Firstly both the 
tournament methods are contrasted with n-way classification method, in which tournaments methods gives the higher 
accuracy. After that ET and RRT are compared in which RRT performs slightly better than ET. 

[7]In this paper different classification algorithms are compared which includes J48 decision tree, NB, NN and 
SVM for the spam mail filtering. 

Li and his team introduce [8] ME model and follows two phase way to categorize the emails bases on the contents 
and properties. Then Li started with preprocessing the mails by filtering the non-character symbol, by resolving the links. 
In two phase method first it classify the mails into legitimate and spam and in the second phase emails are categorized into 
7 categories. For the comparison ME model is tested with NB, SVM, and KNN. ME model is the best one. 

This paper [9] implemented the Evolving Email Clustering Method [EECM] that groups the emails based on 
user’s activities. To examine the grouping accuracy of EECM algorithm Davis Bouldin validity index is used, which are 
used for measuring the goodness, quality, validity of the grouping technique. So EECM algorithm is compared with 
K-means, Fuzzy over the Enron dataset, in which EECM performs better. 

Lu et al. [10] proposes the Semantic Vector Space Model [sVSM] for the purpose of email categorization. 
The traditional VSM do not contain the semantic relations, so that why the author proposes sVSM method to remove this 
problem. So for creating semantic vector, features are extracted by considering the hepernymy-hyponymy relations 
between the synonym sets. To assign the weights of sematic vector tf*iwf*iwf algorithm is used. Three experiments are 
designed to evaluate the performance of this method. In the first experiment the traditional VSM and sVSM are compared, 
in which proposed method performed well. In the second experiment, proposed method is contrast with Bayesian and KNN 
algorithms, in which KNN gives higher results as contrast to others. Third experiment shows that with increasing of email 
set, categorizing performance also increases. 
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Matwin [11] presented the co-training algorithm to solve the problem of unlabeled data by using 1500 emails on 
SVM and Naive Bayes. Experiment is conducted using the weka tool. Firstly pre-processing is done by removing the stop 
words and by stemming. Problem is divided into highly balanced, medium balanced and balanced. SVM gives better 
results than Naive Bayes. Author also tries to find the reason that why Naive Bayes gives poor results than SVM. So they 
experiment with the Naive Bayes by removing the features. We see that SVM also outperforms with very large features 
than Naive Bayes. 

Yang et al. [12] discusses about the spam mails in the healthcare organization. So to classify the emails into spam 
or ham, common characteristics are extracted like no drug effects, disease name from the Tree dataset. Different machine 
learning algorithms are applied. So to improve the accuracy. Decision tree and naive Bayes algorithms are combined and it 
gives higher accuracy as compared to others and also error rate is low. 

Kumar et al. [13] compares the fifteen classification algorithms for the classification of email spam. [14] Soni et 
al. proposes an AEMS (Automatic Email Management System) for the handling of emails. 

Mishra et al. [15] also worked on spam categorization. So, to carry out this task author uses the different tools to 
find out the best one. Weka performs better as compares to Rapid miner and support vector machine. 

Tang et al. [16] presents the survey on email mining. Author not only reviewed the single task in the email 
mining, rather he presented the five major tasks -namely spam detection, contact analysis, email filing, email visualization 
and email network property analysis. He also mentions the related techniques and software tools to mine the email. Future 
directions are provided by giving the two examples that are email egocentric network and email monetization. 

Many classification algorithms are used for the classification of emails to check whether it is legitimate or 
non-legitimate. Author [17] performs this experiment in the real environments to check the performance of these 
algorithms. So the author collected the email datasets from the university, company, research institute. Results show that 
university gives the higher percentage of spam messages due to various subscription services. Decision tree and SVM 
gives the better results. 

Alsmadi et al. [18] carried out email categorization on the personal email dataset. SVM, KNN, N-gram methods 
are developed to achieve clustering and classification of emails. Classification based on N-Gram is shown to be the best as 
text is Bi-language. 


Table 1: Various Techniques, Algorithms for Email Mining from the Era 2002 to 2015 


Paper 

Year 

Dataset 

Techniques 

Algorithms 

[3] 

2002 

r t 

c f 

TF-IDF, NB 

[4] 

2004 

E r , CMU 

c f 

SVM 

[5] 

2004 

E r , SRI 

Cf 

NB,SVM, ME, WMW 

[6] 

2005 

Rt 

T m 

E m , RRT 

[7] 

2007 

Rt 

Cf 

NN, SVM, NB, J48 

[8] 

2007 

Rt 

Cf 

ME, KNN, SVM, WMW, NB 

[9] 

2009 

Rt 

c r 

EECM, K-means 

[10] 

2010 

20-ng 

Cf 

tf*iwf*iwf, sVSM, KNN, NB 

[11] 

2011 

Rt 

Cf 

Co T , SVM, NB 

[12] 

2012 

TC 

A s , Cf, C r 

NB, SVM, J48, K m 

[13] 

2012 

s b 

Cf 

ID3, K-NN, SVM, RF, NB, LDA 

[14] 

2013 

20-ng 

A s , C r , Cf 

A d , non-parametric K nl ++ 

[15] 

2014 

U n , E r , SA 

Cf 

RF, B 2 , SVM, NB 
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Table 1: Contd., 

[17] 

2015 

Rx 

c f 

SVM, J48, NB 

[18] 

2015 

Rx 

c f , c r 

SVM, K m , NG 


R T -Real time, E r -Enron, ng-newsgoup, TC-Trec Corpus, S b -Spambase, U n -Usenet, SA-Spam Asian, Q- 
Classification, C, -Clustering, A s -Association, T m -Tournament, TF-IDF-Term Frequency -Inverse Document Frequency, 
NB-Nai've Bayes, SVM-Support Vector Machine, NN-Neural Network, WMW-Wide Margin Winnow, KNN-k-Nearest 
Neighbor, ME-Maximum Entropy, E m -Elimination, RRT-Round Robin tournament, A p -Apriori, EECM- Evolving Email 
Clustering Method, K m -K-mean, Co T -Cotraining algorithm RF- Random Forest, B g -Bagging, NG-NGrams 

4. EXPERIMENT 

• Corpus 

In this paper, a telephonic industry’s emails are collected. Common data about the emails’ dataset is composed 
from Google provided for categorize the emails based on their content. There are many other public email corpuses 
available like enron, spambase, Usenet, SRI etc. But some corpuses are used to classify the spam emails and some are 
categorized based on the users. So here we build our own dataset. 

• Emails Content Pre-Processing 

A MIME parser is then used to parse information from those emails to make a dataset that contain one record for 
every email with the following information parsed: Email file name, email body, subject 

• Emails Content Data Mining 

An automated tool is to further analyze the content from all emails and measure frequency of words. More than 
20,000 words are collected. Stemming is also applied in the term frequency table. 

• E-Mail Clustering 

We obtain entire email as centroid and divide into clusters. After dividing into clusters the content, it will pass 
through the knowledge dictionary set for scanning. We obtain the score for each cluster. Finally calculate the distance 
between cluster to cluster and cluster to original content. 

RESULTS 


Table 2 


Words 

Cluserl 

Cluster2 

Email 

refund 

0.67 

0.76 

1.43 

connection 

0.89 

0.9 

1.79 

waiver 

1.1 

0.23 

1.33 

billing 

0.65 

1 

1.65 

charge 

0 

0.35 

0.35 

network 

0.78 

0.9 

1.68 

prepaid 

1.1 

0.9 

2 

postpaid 

0.98 

0.99 

1.97 

service 

0.87 

0.92 

1.79 

complaint 

0.87 

0.76 

1.63 

issue 

0.34 

0.54 

0.88 
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Figure 1: Shows the IDF Values 


Table 3: Shows the Accuracy, Precision, and Recall 


PROPERTY 

RESULTS 

True Positive 

955 

True Negative 

10 

False Positive 

15 

False Negative 

0 

Sensitivity (Recall) 

97.94% 

Precision (Positive Predictive Value) 

98.45% 

Result Prevalence 

97.50% 

Accuracy 

96.50% 


For the assessment different metrics Precision, Recall, Accuracy are used. 


Property Results 
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Figure 2: Graphical Representation of Precision, Recall 


Work Flow Diagram 
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Figure 3: Shows E-Mail Pre-Processing 
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Figure 4: Content Clustering 


CONCLUSIONS 

Emails’ classification in particular utilizes several data mining activities such as: Text parsing, stemming, 
classification, clustering. There are many goals or reasons why to cluster or classify emails, his may include reasons such 
as: Spam detection, contact analyses, email categorization. Results show that our system works perfectly by categorizing 
the email into relevant folders. 
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